1. Motivation:

The flights are frequently used as transportation system due to the size of land in US. If you ever traveled across the country, then you might concern the flight delay if you have any important meeting, business work, or if you need to catch the next connecting flight, or so. Therefore, investigating the flight's on-time performance by time frame, carriers, or reasons seems to be reasonable in order to resolve several questions. The following exploratory questions are answered throughout this report:

  1. What is the top reason for delay in overall across carriers?
  2. Which flight carrier has the most frequent delay by year?
  3. Is there any same pattern/trend for the flight delay by carrier in each cause?
  4. Is the flight delay related to the flight distance?

2. Data Source:

  1. From Bureau of Transportation Statistics website: http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time
  2. Format: .csv files
  3. Size: 3.74 GB (61 files)
  4. Dataset periods: 2011. Sep - 2016. Sep
  5. Main variables: Year(INT), Quarter(INT), Carrier name(TEXT), Departure delay minutes(INT), Departure delay group(Factor), Distance(INT), 5 causes to delay flights(INT)

3. Methods and Analysis Results

Since the dataset is huge (total 30,302,072 records in the original dataset), I think it's better to import the dataset as DB which was generated and filtered by using Python and extract necessary columns for each questions using SQL. While cleaning the dataset in Python, the records are only considered if there is at least one delay cause and if delay minutes are greater than or equal to zero. This is done for the overall dataset. The cleaned dataset (total 8,814,394 records) and its summary statistics look like the below.

Import data from DB into R.

##    obs_num year quarter carrier dep_delay dep_delay_group distance
## 1        2 2011       3      AA        10               0     2475
## 2        3 2011       3      AA        52               3     2475
## 3        5 2011       3      AA        79               5     2475
## 4        7 2011       3      AA       214              12     2475
## 5        8 2011       3      AA        42               2     2475
## 6        9 2011       3      AA        87               5     2475
## 7       10 2011       3      AA       383              12     2475
## 8       11 2011       3      AA        58               3     2475
## 9       14 2011       3      AA       304              12     3784
## 10      15 2011       3      AA        75               5     3784
## 11      16 2011       3      AA         8               0     3784
## 12      17 2011       3      AA        95               6     3784
## 13      18 2011       3      AA        30               2     3784
## 14      22 2011       3      AA       308              12     3711
## 15      25 2011       3      AA        10               0     3711
## 16      26 2011       3      AA        79               5     3711
## 17      27 2011       3      AA        74               4     3711
## 18      28 2011       3      AA        45               3     3711
## 19      29 2011       3      AA       244              12     3711
## 20      30 2011       3      AA        51               3     3711
##    distance_group delay_reason
## 1              10            1
## 2              10            1
## 3              10            1
## 4              10            1
## 5              10            1
## 6              10            1
## 7              10            1
## 8              10            1
## 9              11            1
## 10             11            1
## 11             11            1
## 12             11            1
## 13             11            1
## 14             11            1
## 15             11            1
## 16             11            1
## 17             11            1
## 18             11            1
## 19             11            1
## 20             11            1
##    year         quarter        carrier          dep_delay      
##  2011: 413456   1:2061974   WN     :2132084   Min.   :   0.00  
##  2012:1563476   2:2305939   EV     :1059558   1st Qu.:  21.00  
##  2013:2019226   3:2429545   AA     : 973360   Median :  39.00  
##  2014:2023269   4:2016936   DL     : 828152   Mean   :  58.05  
##  2015:1664532               OO     : 798608   3rd Qu.:  73.00  
##  2016:1130435               UA     : 796603   Max.   :2402.00  
##                             (Other):2226029                    
##  dep_delay_group      distance      distance_group    delay_reason
##  1      :2010312   Min.   :  24.0   2      :2203340   1:2899782   
##  2      :1607225   1st Qu.: 368.0   3      :1751157   2: 304832   
##  0      :1327665   Median : 641.0   4      :1361047   3:2569572   
##  3      :1043068   Mean   : 797.1   1      :1124083   4:  18174   
##  4      : 700937   3rd Qu.:1031.0   5      : 930797   5:3022034   
##  5      : 496332   Max.   :4983.0   7      : 399416               
##  (Other):1628855                    (Other):1044554

Q1. What is the top reason for delay in overall across carriers?

3.(a) how did you manipulate the data to prepare it for analysis?

  • General manipulation is done via python. This data is extracted from DB using SQL. The first 10 records are shown like the below.

  • obs_num is the unique observation numbers and it is used since some records can have multiple resons for the delay.

##         obs_num delay_reason carrier
## 3204615       1            3      AA
## 1             2            1      AA
## 3204616       2            3      AA
## 2             3            1      AA
## 3204617       4            3      AA
## 3             5            1      AA
## 3204618       5            3      AA
## 3204619       6            3      AA
## 4             7            1      AA
## 3204620       7            3      AA

3.(b) How did you handle missing, incomplete, or noisy data?

  • Missing or incomplete data is removed in python.

  • The record is kept when there is at least one cause of delay and the delay minutes is greater than or equal to zero.

3.(c) How did you perform data analysis in code?

  • In order to count and calculate percentage of each cause of delay, ddply and ggplot is used. The result is shown below.

3.(d) What challenges did you encounter and how did you solve them?

  • For Q1, there was no challenge in R. However, I struggled with cleaning the original dataset when DB is created in Python. Especially, it took a bit of time to figure out a fast way to combine and make a new column consiting of delay reasons. Since the delay reasons are splitted into different columns for each, I should make them into a column, while specifying what reason was. This step took the most of time and I finally came up with the solution by googling and referring lecture materials from SI 601 and SI 618.

4.(a) Analysis result

  • During 2011. Sep - 2016. Sep, the top three reasons for the delay were late aircraft, carrier and national air system, taken by 34.3%, 32.9% and 29.2%, respectively.

4.(b) Additional visualization

  • I think that the question is simple and this histogram itself shows the insight already. Therefore, just check how this pattern appears in each carrier and displayed only two carriers.

Q2. Compare the count delay minutes is greater than or equal to 90 (mins) by carrier and Year. Which airlines have the most frequent delay that is more than 90 minutes?

3.(a) how did you manipulate the data to prepare it for analysis?

  • Since DB contains duplicate rows due to multiple reasons of delay for a record, non-duplicate information containing obs num, year, quarter, carrier, and delarture delay group are extracted using SQL. The data records are 952980 and it is shown the below.
##    obs_num year quarter carrier dep_delay_group
## 1        7 2011       3      AA              12
## 2       10 2011       3      AA              12
## 3       14 2011       3      AA              12
## 4       17 2011       3      AA               6
## 5       22 2011       3      AA              12
## 6       29 2011       3      AA              12
## 7       37 2011       3      AA               8
## 8       44 2011       3      AA               7
## 9       47 2011       3      AA               7
## 10      55 2011       3      AA               6

3.(b) How did you handle missing, incomplete, or noisy data?

  • Using SQL, the duplicate rows for those five columns are omitted while extracting them from DB adding DISTINCT keyword.

  • In order to compare in the same period (1 year), the plot is drawn with only year 2012 - 2016.

3.(c) How did you perform data analysis in code?

  • Group by year and carrier, count the frequency for the cases that delayed time is greater than or equal to 90 mintues across 2012-2016, using ddply.

  • Since the carrier is coded as abbreviation, I created a carrier table consisting of abbreviation and its corresponding fulll name from look-up table that downloaded from the same web source. In order to remove essential part of the name, regular expression is used. After filtering and matching, the data looks like the below.

##    carrier year delgp6_count               FullName
## 1       9E 2013        10123      Endeavor Air Inc.
## 2       9E 2013        10123 Pinnacle Airlines Inc.
## 4       AA 2015        22487 American Airlines Inc.
## 5       AA 2013        18212 American Airlines Inc.
## 6       AA 2012        16905 American Airlines Inc.
## 7       AA 2014        18955 American Airlines Inc.
## 8       AA 2016        24901 American Airlines Inc.
## 10      AS 2013         2003   Alaska Airlines Inc.
## 11      AS 2012         2509   Alaska Airlines Inc.
## 12      AS 2014         2409   Alaska Airlines Inc.
  • Using ggplot, bar plots by carrier across the time frame was plotted.

4.(a) Analysis result

  • It seems to be some distinguish between airlines. Some airlines have high frequency relative to others and the frequency tend to be stable or fluctuated. In order to answer Q2, the additional visualization would be more appropriate. That is, cutting the frequecy and see which airlines have more frequency when delay minutes is greater than or equal to 90 mins.

4.(b) Additional visualization

  • In order to see which airlines have the top 7 frequency for the flight delay minutes is greater than or equal to 90, draw only for the airlines if their frequency hits 20000 at least once. Since 2012 to 2014, Atlantic Southeast and ExpressJet Airlines have relatively high frequency compared with others, but the trend has decreased after 2015. But, be careful to interpret this result in 2016 since 2016 contains from Jan to Sep.

Reference : http://stackoverflow.com/questions/17368223/ggplot2-multi-group-histogram-with-in-group-proportions-rather-than-frequency

3.(d) What challenges did you encounter and how did you solve them?

  • I tried to use the way with qplot that is used in HW2. However, it took lots of time and the R was frozen while running. I guessed it is because internal processing took more time and consumed more memory than ggplot. Therefore, I explored another method to draw the plot with ggplot.

Q3. For each carrier, how the frequency by delay reason has been changed over time (quarter)?

3.(a) how did you manipulate the data to prepare it for analysis?

  • Including delay_reason, extract obs_num, year, quarter and carrier from DB. The data looks like the below.
##    obs_num year quarter carrier delay_reason
## 1        2 2011       3      AA            1
## 2        3 2011       3      AA            1
## 3        5 2011       3      AA            1
## 4        7 2011       3      AA            1
## 5        8 2011       3      AA            1
## 6        9 2011       3      AA            1
## 7       10 2011       3      AA            1
## 8       11 2011       3      AA            1
## 9       14 2011       3      AA            1
## 10      15 2011       3      AA            1

3.(b) How did you handle missing, incomplete, or noisy data?

  • In Q3, I didn't need to control missing, incomplete or noisy data, because DB was already cleaned in Python.

3.(c) How did you perform data analysis in code?

  • Using ddply, calculate the frequency for each carrier by quarter with respect to delay reason.

3.(d) What challenges did you encounter and how did you solve them?

  • In general, using mfrow allows for the users to plot multiple figures in a page. However, I have never used this kind of setting for ggplot. By googling, I found out how to draw multiple ggplots in a page. But I did not figure out how to use for loop to implement marrangeGrob function in gridExtra package. That's why I increased height and width option in Rmarkdown in order to fit the whole plots in a setting.

- Reference:

4.(a) Analysis result

  • The plots show quite interesting results. Generally, Weather or Security are minor reasons to cause delay regardless of airelines. The rest of three causes are generally major cause: Carrier, Late Aircraft, National Air System. For those major causes, the plots show increasing, stable, or decreasing trends for the number of occurence for flight delays in each reasons. Interestingly, American Airlines and Virgin America (They are major airlines!) have a steep upward trend relative to others. However, AirTran Airways Airlines, American Eagle Airlines, Envoy Air and Simmons Airlines have decreasing trend in terms of frequency. However, I think this trend should be weighted by the total number of flights in each carrier. Otherwise, it is possible to mislead people like the previous interpretation. Becuause, the upward trending may be caused due to the increase the number of flights and therefore, the frequency can be proportional. In order to solve this issue, the original data should include all the records. However, due to the lack of memory in the computer, it seems to be hard to gather all the data even if I use python to generate DB. I think if there is information to calculate the totla number of flights by carreir in each time frame, then it would be helpful to give more reliable plots.

4.(b) Additional visualization

  • For Virgin America airlines, plot only after 2013, quarter 3 in order to see whether their it shows upwarding trend in that time frame. As I mentioned in 4.(a), it has a big standard error interval along the line. It supports this trend may be not reliable, since it does not considered the overall number of flights run by Virgin America.